Quantifying Counts, Costs, and Trends Accurately via Machine Learning

نویسنده

  • George Forman
چکیده

In many business and science applications, it is important to track trends over historical data, for example, measuring the monthly prevalence of influenza incidents at a hospital. In situations where a machine learning classifier is needed to identify the relevant incidents from among all cases in the database, anything less than perfect classification accuracy will result in a consistent and potentially substantial bias in estimating the class prevalence. There is an assumption ubiquitous in machine learning that the class distribution of the training set matches that of the test set, but this is certainly not the case for applications where the goal is to measure changes or trends in the distribution over time. The paper defines two research challenges for machine learning that address this distribution mismatch problem. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in an unlabeled test set via machine learning, using a limited training set that may have a substantially different class distribution. The ‘cost quantification’ task is to estimate the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the hours of labor needed to resolve the case. Obtaining a precise quantification estimate over a set of cases has a very different utility model from traditional classification research, whose goal is to obtain an accurate classification for each individual case. For both forms of quantification, the paper describes a suitable experiment methodology and evaluates a variety of methods. It reveals which methods give more reliable estimates, even when training data is scarce and the testing class distribution differs widely from training. Some methods function well even under high class imbalance, e.g. 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hypertension Prediction in Primary School Students Using an Ensemble Machine Learning Method

Introduction: The prevalence of hypertension in children is increasing, and this complication is considered the most important risk factor for cardiovascular diseases in older age. Early detection and control of hypertension can prevent its progress and reduce its consequences. Machine learning methods can help predict this complication promptly and reduce cost and time. This study aimed to pro...

متن کامل

Intelligent Feature Selection for Opinion Classification

References 1. T. Macer, M. Pearson, and F. Sebastiani, “Cracking the Code: What Customers Say, in their own Words,” Proc. 50th Ann. Conf. Market Research Soc. (MRS 07), MRS, 2007. 2. D. Giorgetti and F. Sebastiani, “Automating Survey Coding by Multiclass Text Categorization Techniques,” J. Am. Soc. Information Science and Technology, vol. 54, no. 14, 2003, pp. 1269–1277. 3. G. Forman, “Quantify...

متن کامل

Hypertension Prediction in Primary School Students Using an Ensemble Machine Learning Method

Introduction: The prevalence of hypertension in children is increasing, and this complication is considered the most important risk factor for cardiovascular diseases in older age. Early detection and control of hypertension can prevent its progress and reduce its consequences. Machine learning methods can help predict this complication promptly and reduce cost and time. This study aimed to pro...

متن کامل

Two-stage fuzzy-stochastic programming for parallel machine scheduling problem with machine deterioration and operator learning effect

This paper deals with the determination of machine numbers and production schedules in manufacturing environments. In this line, a two-stage fuzzy stochastic programming model is discussed with fuzzy processing times where both deterioration and learning effects are evaluated simultaneously. The first stage focuses on the type and number of machines in order to minimize the total costs associat...

متن کامل

Machine learning based Visual Evoked Potential (VEP) Signals Recognition

Introduction: Visual evoked potentials contain certain diagnostic information which have proved to be of importance in the visual systems functional integrity. Due to substantial decrease of amplitude in extra macular stimulation in commonly used pattern VEPs, differentiating normal and abnormal signals can prove to be quite an obstacle. Due to developments of use of machine l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007